EDA Project

2021WI_MSDS_430-DL_SEC58 Python for Data Analysis

GROUP 2 Gaurav Garg, Melissa Gillette, Mark Stockwell

The EDA project in this course has four main parts to it:
1. Project Proposal 2. Phase 1 3. Phase 2 4. Report This notebook will be used for Project Proposal, Phase 1, and Phase 2. You will have specific questions to answer within this notebook for Project Proposal and Phase 1. You will also continue using this notebook for Phase 2. However, guidance and expectations can be found on Canvas for that assignment. The report is completed outside of this notebook (delivered as a PDF). Detailed instructions for that assignment are provided in Canvas.
Read this before proceeding: 1. Review the list of data sets and sources of data to avoid before choosing your data. This list is provided in the instructions for the Project Proposal assignment in Canvas.

2. It is expected that when you are asked questions requiring typed explanations you are to use a markdown cell to type your answers neatly. Do not provide typed answers to questions as extra comments within your code. Only provide comments within your code as you normally would, i.e. as needed to explain or remind yourself what each part of the code is doing.

Project Proposal

The intent of this assignment is for you to share your chosen data file(s) with your instructor and provide general information on your goals for the EDA project.
Step 1 (2 pts): Give a brief description of the source(s) of your data and include a direct link to your data.

Description of Dataset Source:

COVID DATA - The CSV file contains daily data on the COVID-19 pandemic for the US and individual states. All dates and times are in US eastern time (ET).

Data Source Summary: Almost all of the data compiled is taken directly from the websites of state/territory public health authorities. In cases, data is missing from these websites, it's supplemented with information from official state social media accounts or from press conferences with governors or other state authorities.

Each state has its own set of caveats, that are documented on the project site data page.

Link to Dataset:

Step 2 (2 pts): Briefly explain why you chose this data.

Over the last year it has proven to be a reliable source of information for tracking COVID related testing, infections and deaths across the US.

Step 3 (1 pt): Provide a brief overview of your goals for this project.

Goals

The overall goal of this project is to understand the relationship and timing of COVID testing, hospitalizations, ventilator use, and deaths. It is believed these statistics are related and can be used in a predictive way for hospital planning purposes.

Step 4 (1 pt): Read the data into this notebook.
Step 5 (1 pt): Inspect the data using the info( ), head( ), and tail( ) methods.
STOP HERE for your Project Proposal assignment. Submit your (1) original data file(s) along with (2) the completed notebook up to this point, and (3) the html file for grading and approval.
Instructor Feedback and Approval (3 pts): Your instructor will provide feedback in either the cell below this or via Canvas. You can expect one of the following point values for this portion. 3 pts - if your project goals and data set are both approved.
2 pts - if your data set is approved but changes to your project goals (Step 3) are needed.
1 pt - if your project goals are approved but your data set is not approved.
0 pts - if neither your data set nor your project goals are approved.

As needed, follow your instructor's feeback and guidance to get on track for the remaining portions of the EDA project.

EDA Phase 1

The overall goal of this assignment is to take all necessary steps to inspect the quality of your data and prepare the data according to your needs. For information and resources on the process of Exploratory Data Analysis (EDA), you should explore the EDA Project Resources Module in Canvas. Once you've read through the information provided in that module and have a comfortable understanding of EDA using Python, complete steps 6 through 10 listed below to satisfy the requirements for your EDA Phase 1 assignment. **Remember to convert code cells provided to markdown cells for any typed responses to questions.**
Step 6 (2 pts): Begin by elaborating in more detail from the previous assignment on why you chose this data?
1. Explain what you hope to learn from this data. 2. Do you have a hunch about what this data will reveal? (The answer to this question will be used in the Introduction section of your EDA report.)

We hope to find answers to the following questions:

1) Is there a data correlation (lead and lag indicators) that can help predict future hospitalizations. The benefit of this analysis would enable better hospital capacity planning.

2) Is there is a correlation between staff experience with the disease over time that would decrease the use of ventilators and deaths. The benefit of this analysis and correlation would stress the importance of training staff at an accelerated rate (particulary on latest treatment protocols) to ultimately decrease the death rate. This analysis would also help future pandimic response with planning by identifying certain metrics that once hit, improvements should occur as far as disease rate goes.

Our hunch is that the increase in positive test ratios correlate to future hospizalizations - regardless of absolute numbers of tests, the positive test ratio is a proxy for overall disease prevalence in the population. Our other hunch is that with greater disease experience and evolution of treatments the ratio of ventilator usage and deaths to total hospitalization will decrease. In other words, greater insight into the disease combined with effective treatment options will lower the percentage of patients needing ventilators and dying.

Step 7 (2 pts): Discuss the popluation and the sample:
1. What is the population being represented by the data you've chosen? 2. What is the total sample size?

The population represented is based on COVID related data captured across all the US states and 6 Territories (American Samoa, Virgin Islands, Puerto Rico, District of Columbia, Mariana Islands and Guam).

However, in our analysis we will focus on the key states (Top 5 States Ranked in terms of Deaths) and the State of Washington where the first COVID linked Fatality and Infection were reported in the US.

Chart - Breakdown of Deaths per State and the Count of Rows Available.

Step 8 (2 pts): Describe how the data was collected. For example, is this a random sample? Are sampling weights used with the data?
Step 9 (4 pts): In the Project Proposal assignment you used the info( ) method to inspect the variables, their data types, and the number of non-null values. Using that information as a guide, provide definitions of each of your variables and their corresponding data types, i.e. a data dictionary. Also indicate which variables will be used for your purposes.

Data Dictionary

Based on intial data set, there are 41 variables. The info() command provides detailed technical metadata, specifically the datatypes of each column (most are numeric). The dataset includes column headers with descriptive names which are generally helpful in identifying the meaning of each variable. In addition there are multiple sources of business/semantic metadata for each column:

The spreadsheet and API can be read programatically and linked to the dataset metadata to produce a consolidated data dictionary. The code below reads each source and joins them in a pandas data frame, and then exports to an excel file for easy viewing.

JSON Metadata

Spreadsheet Metadata

Create Data Dictionary

Step 10 (10 pts): For full credit in this problem you'll want to take all necessary steps to report on the quality of the data and clean the data accordingly. Some things to consider while doing this are listed below. Depending on your data and goals, there may be additional steps needed than those listed here. 1. Are there rows with missing or inconsistent values? If so, eliminate those rows from your data where appropriate. 2. Are there any outliers or duplicate rows? If so, eliminate those rows from your data where appropriate. At each stage of cleaning the data, state how many rows were eliminated. 3. Are you using all columns (variables) in the data? If not, are you eliminating those columns? 4. Consider some type of visual display such as a boxplot to determine any outliers. Do any outliers need removed? If so, how many were removed? At each stage of cleaning the data, state how many rows were eliminated. It is good practice to get the shape of the data before and after each step in cleaning the data and add typed explanations (in separate markdown cells) of the steps taken to clean the data.
Include the rest of your work below and insert cells where needed.

Clean and Filter data

Select the rows (States) to include, drop inconsistent data.

Column Selection

Not all columns are needed. We are interested in hospitalizations, ICU usage, Ventilator usage, deaths, and testing.

STOP HERE for your EDA Phase 1 assignment. Submit your cleaned data file along with the completed notebook up to this point for grading.

EDA Phase 2

All of your work for the EDA Phase 2 assignment will begin below here. Refer to the detailed instructions and expectations for this assignment in Canvas.

Look at cleaned data

Based on counts by state, following statements on the data can be made:

NOTE: Remaining analysis will focus on states with best data: MI, NY, OH

- Add positive test ratio column

This column will be needed to test hypothesis 1

Focus on NY data

Sort data ascending by date, smooth data with 7 day average

Smooth data with 7 day average

It can be seen in charts that the "choppiness" has disappeared.

- Add MI, OH to rolling mean dataset

Note that the "date" and "state" columns become the index. This makes charting easier and ensures only one date/state combination (uniqueness).

Correlations - NY data

Plot Rolling Mean Variables

Add veriage here explaining the chart above

Note in pair plots above there is strong correlation between hospitalizations, ICU, posititivity rate, and death within each state.

Hypothesis Testing

Hypothesis 1 - Positivity Rate vs. Hospitalizations

Methodology: extract state/date/positiveRatio/hospitalizedCurrently figures. Add additional columns hospitalizedCurrently at T+1 thru T+14. Measure correlation at each day and determine maximum correlation time lag.

Analysis of Testing Volume

Analysis of Positivity Rate

The correlation betweeen t1 and t2 is slightly higher than same day comparison of positive test ration and hospitalizations. Overall the correlation is weak(~0.4)

From this we can say an increase in positive test ratio will indicate slight increase in hospitalizations 2 days later.

Hypothesis 2 - Hospitalizations vs. Ventilator Usage and Death

Methodology: extract state/date/hospitalizedCurrently/onVentilatorCurrently/deathIncrease figures. Add additional columns onVentilatorCurrently, deathIncrease at T+1 thru T+14. Measure correlation at each day and determine maximum correlation time lag.

Analysis of Correlation

Analysis of Time Shifted Data

From the charts above, we can infer the following:

Hypothesis 3 - Ventilator Usage vs. Hospitalization over Time

Over time as hospital staff has more disease experience and better therapies evolve, the ratio of ventilator usage and deaths to total hospitalizations has decreased.

Analysis of Ventilator Usage over time

From the charts above it can be seen that